Data Visualization Notes

Make a Plot

## General strategy: First define the object for ggplot and assign it a name.
## Then add geom_*'s etc. additively and display immediately.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp))

## A first version of a relatively polished plot
p + geom_point(alpha = 0.3) + 
    geom_smooth(method = "lm") + 
    scale_x_log10(labels = scales::dollar) + 
    labs(x = "GDP Per Capita", y = "Life Expectancy in Years", 
         title = "Economic Growth and Life Expectancy",
         subtitle = "Data points are country-years",
         caption = "Source: Gapminder")

The color aesthetic affects the appearance of lines and points, the fill aesthetic controls the appearance of the filled areas of bars, polygons or, in this case, the interior of the smoothers’s standard error ribbons.

## Note that within the curly brackets we can set options to be different from the default.
p <- ggplot(data = gapminder, mapping = aes(x = gdpPercap, y = lifeExp, 
    color = continent, fill = continent))

p_out <- 
  p + geom_point(alpha = 0.3) + 
    geom_smooth(method = "lm") + 
    scale_x_log10(labels = scales::dollar) + 
    labs(x = "GDP Per Capita", y = "Life Expectancy in Years", 
         title = "Economic Growth and Life Expectancy",
         subtitle = "Data points are country-years",
         caption = "Source: Gapminder")

## Save via ggsave and, if the plot is assigned a name, specify plot = ...
## Vector based graphs (.pdf / .svg) are easier to resize, but larger than raster based graphs (.png / .jpg)
ggsave(filename = here("Kieran Healy, Data Visualization Book", "figures", "gdp_lifeexp_chapter3.pdf"), plot = p_out, 
       height = 9, width = 16, units = "in")

Show the Right Numbers

Use facet_wrap() to split data by group and create separate mini-plots for each group.

p <- ggplot(data = gapminder, aes(x = year, y = gdpPercap))
p + geom_line(color = "gray70", aes(group = country)) +
    geom_smooth(size = 1.1, method = "loess", se = FALSE) + 
    scale_y_log10(labels = scales::dollar) + 
    facet_wrap(facets = vars(continent), ncol = 5) +
    labs(x = "Year", y = "GDP per capita", 
         title = "GDP per capita on Five Continents")

Alternatively, you can use facet_grid() to split groups into sub-plots with more control, along a true 2-dimensional grid. To illustrate, we use data from the gss_sm dataset which contains questions from the 2016 General Social Survey.

p <- ggplot(data = gss_sm, aes(x = age, y = childs))
p + geom_point(alpha = 0.2) +
    geom_smooth() +
    facet_grid(rows = vars(sex), cols = vars(race))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 18 rows containing non-finite values (stat_smooth).
## Warning: Removed 18 rows containing missing values (geom_point).

stat_* functions and their relation to geom’s

Every geom_* function has an asscociated stat_* function that it uses by default. The reverse is also true, every stat_* function has an associated geom_* function that it will plot by default. For example, the geom_bar() function automatically calls the stat_count() function which calculates the values for the count and the proportion. We can then call these values in the mapping through y = ..prop...

p <- ggplot(data = gss_sm, mapping = aes(x = bigregion))
## Note that we need to add group = 1 to tell R to ignore the grouping by the x-variable
## when calculating the proportions.
p + geom_bar(aes(y = ..prop.., group = 1))

## If we want to additionally color the bars according to their category we use the
## fill aesthetic (the color aesthetic would only create a colored border).
## Note that the reader needs no legend, because the categories are already encoded
## in the x axis. We can turn of guidance for an aesthetic with the guides() function.
p + geom_bar(aes(fill = bigregion)) +
    guides(fill = FALSE)

## Let's display what the proportion of each religion within each state is.
## Of all people in the Northeast, what proportion are Catholic, etc... We can do 
## this two ways, by using the same bar diagram and positioning the different religions
## side-by-side...
p <- ggplot(data = gss_sm, mapping = aes(x = religion, fill = bigregion))
p + geom_bar(position = "dodge", mapping = aes(y = ..prop.., group = bigregion))

## While this works, it makes it unintuitive to compare bars across different groups.
## It is also not clear immediately which group (religion or region) sum up to 1.
## In this case, it is probably easier to let the bar's divide by religion and facet
## by region. We then do not need the fill aesthetic.

p <- ggplot(data = gss_sm, mapping = aes(x = religion))
p + geom_bar(mapping = aes(y = ..prop.., group = bigregion)) + facet_wrap(vars(bigregion))

The identity stat and position

You can call stat = "identity to say that ggplot should perform no calculation on the data. This is helpful to display data as is without geom_*’s performing their default calculations of summary statistics.

p <- ggplot(data = titanic, aes(x = fate, y = percent, fill = sex))
p + geom_bar(position = "dodge", stat = "identity")

## An alternative is the geom_col() function which is the same as geom_bar with the
## default stat = "identity".
# p + geom_col(position = "dodge")

Similarly, one can set the position argument to position = "identity" to tell ggplot to display the data exactly where it is. This can be used (with ordered data) to use bar plots as alternatives to line plots.

p <- ggplot(data = oecd_sum, aes(x = year, y = diff, fill = hi_lo))
p + geom_col() +
    guides(fill = FALSE) +
    labs(x = NULL, y = "Difference in Years", 
         title = "The US Life Expectancy Gap",
         subtitle = "Difference between US and OECD average life expectancies, 1960-2015",
         caption = "Data: OECD. After a chart by Christopher Ingraham, Washington Post, December 27th 2017.")
## Warning: Removed 1 rows containing missing values (position_stack).

### Exploring alternatives to the histogram

p <- ggplot(data = gapminder, aes(x = gdpPercap))
p + geom_bin2d(aes(y = lifeExp), bins = c(25, 20)) +
    scale_x_log10(labels = scales::dollar)

Graph Tables, Make Labels, Add Notes

rel_by_region <- gss_sm %>% group_by(bigregion, religion) %>% 
  summarise(N = n()) %>% mutate(freq = N / sum(N), pct = round((freq*100), 0))
## `summarise()` regrouping output by 'bigregion' (override with `.groups` argument)
## As a rule, dodged charts can be more cleanly expressed as faceted plots and 
## to prevent long and overlapping labels, coord_flip() is the way to go.
p <- ggplot(data = rel_by_region, mapping = aes(x = religion, y = pct, fill = religion))
p + geom_col(position = "dodge2") +
    labs(x = NULL, y = "Percent", fill = "Religion") +
    guides(fill = FALSE) +
    coord_flip() +
    facet_grid(cols = vars(bigregion))

Boxplots are a good way of displaying variation within groups, separately for different groups. Just like stat_bar(), geom_boxplot() and its connected stat_boxplot() calculate summary statistics to display the median, quantiles and outliers.

## By default, geom_boxplot orders values on the x-Axis alphabetically, but this is
## usually not the most sensible order. Better to order good on top to bad on bottom.
## You can reorder() by mean (default), median, sd. Here median (consistent with black line in
## boxplot) or mean (consistent with center of mass of box of boxplot) are good.
p <- ggplot(data = organdata, mapping = aes(x = reorder(country, donors, na.rm = T), 
                                            y = donors, fill = world))
p + geom_boxplot() +
    labs(x = NULL) +
    coord_flip() +
    theme(legend.position = "top")
## Warning: Removed 34 rows containing non-finite values (stat_boxplot).

## If there are relatively few observations per group (here country), it is desirable
## to (also) show the raw data with geom_point(). Use jitter to perturb points and 
## avoid overplotting.
p + geom_jitter(aes(fill = NULL, color = world), position = position_jitter(width = 0.15)) +
    labs(x = NULL) +
    coord_flip() +
    theme(legend.position = "top")
## Warning: Removed 34 rows containing missing values (geom_point).

Cleveland dotplots. The better bar- or boxplot.

If you are displaying only one point per category (or a summary statistic) it is often more appearling to use a Cleveland dotplot over a bar plot or a dot-and-whisker plot over a boxplot. To do this, you first transform the dataset and create the mean and standard deviation manually and then only plot the mean (as dots) and the standard deviation (as whiskers).

## Note that summarise changed compared to the version in the book. Now scoping
## over variables is done via multiple across() calls.
## The .names below is the default, but would allow for more flexibility. Functions
## are passed as named list.
by_country <- organdata %>% group_by(consent_law, country) %>% 
  summarise(across(is.numeric, list(mean = mean, sd = sd), na.rm = T, .names = "{col}_{fn}"), 
            across(world, first)) %>% 
  ungroup()
## Warning: Predicate functions must be wrapped in `where()`.
## 
##   # Bad
##   data %>% select(is.numeric)
## 
##   # Good
##   data %>% select(where(is.numeric))
## 
## ℹ Please update your code.
## This message is displayed once per session.
## `summarise()` regrouping output by 'consent_law' (override with `.groups` argument)
p <- ggplot(by_country, aes(x = donors_mean, y = reorder(country, donors_mean), color = consent_law))
p + geom_point(size = 3) +
    labs(x = "Donor procurement rate", y = NULL, color = "Consent Law") + 
    theme(legend.position = "top")

# Add whiskers to show the standard deviation with the geom_pointrange geom. Since
p + geom_point(size = 3) + 
    geom_pointrange(aes(xmin = donors_mean - donors_sd, xmax = donors_mean + donors_sd)) +
    labs(x = "Donor procurement rate", y = NULL, color = "Consent Law") +
    theme(legend.position = "top")

## Instead of color coding the type of consent law, it might be better to facet the plot.
## In that case, keep the y axis free floating to not duplicate country rows.
p <- ggplot(by_country, aes(x = donors_mean, y = reorder(country, donors_mean)))
p + geom_point(size = 3) +
    facet_wrap(vars(consent_law), scale = "free_y", ncol = 1) +
    labs(x = "Donor procurement rate", y = NULL) 

Adding text to plots.

## The ggrepel package and its geom_text_repel() function performs better than the
## geom_text() function and does not overlay labels. 

library(ggrepel)

p <- ggplot(elections_historic, aes(x = popular_pct, y = ec_pct, label = winner_label))
p + geom_vline(xintercept = 0.5, size = 1.4, color = "grey80") +
    geom_hline(yintercept = 0.5, size = 1.4, color = "grey80") +
    geom_point() +
    geom_text_repel() +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    labs(title = "Presidential elections: Popular & electoral college margins",
         subtitle = "1824-2016",
         caption = "Data for 2016 is provisional.",
         x = "Winner's share of popular vote", 
         y = "Winner's share of electoral college votes",
         label = NULL) 

## Usually, we would want to display labels only for certain points.

p <- ggplot(data = by_country, aes(x = gdp_mean, y = health_mean))
p + geom_point() +
    geom_text_repel(data = subset(by_country, gdp_mean > 25000), 
                    aes(label = country))

organdata$ind <- organdata$ccode %in% c("Ita", "Spa") & organdata$year > 1998
p <- ggplot(organdata, aes(x = roads, y = donors))
p + geom_point(aes(color = ind)) +
    geom_text_repel(data = subset(organdata, ind), mapping = aes(label = ccode)) +
    guides(color = FALSE, label = FALSE)
## Warning: Removed 34 rows containing missing values (geom_point).

## It looks nice to fade unlabelled points with alpha. Can be done in aesthetic, but
## maybe more robust to do it in geom_point(). geom_text_repel can create labels 
## based on a combination of values from various columns which is cool!
p + geom_point(data = subset(organdata, ind), alpha = 1) +
    geom_point(data = subset(organdata, !ind), alpha = 0.2) +
    geom_text_repel(data = subset(organdata, ind), mapping = aes(label = paste(ccode, lubridate::year(year), sep = " "))) +
    guides(label = FALSE) +
    labs(caption = "Spain and Italy after 1998 are highlighted, other countries are faded out.")
## Warning: Removed 30 rows containing missing values (geom_point).

In the previous examples we added labels to the graph, were the labels had to be (a combination of) existing values in the dataset, like country-year pairs. Instead we can add commentary to a plot by annotating it manually. annotate() simply tells R to add something to the plot. We specify what should be added with the geom = ... option.

p <- ggplot(organdata, aes(x = roads, y = donors))
p + geom_point() + 
    annotate(geom = "text", label = "A surprisingly high \n recovery rate.",
             x = 157, y = 33, hjust = 0) +
    annotate(geom = "rect", fill = "red", alpha = 0.2,
             xmin = 125, xmax = 155, ymin = 30, ymax = 35) + 
    annotate(geom = "curve", x = 175, y = 31.5, xend = 154, yend = 30, curvature = -0.5, color = "red", alpha = 0.7, arrow = arrow(length = unit(0.5, "cm")))
## Warning: Removed 34 rows containing missing values (geom_point).

### Exercises

elections_historic <- elections_historic %>% filter(win_party %in% c("Dem.", "Rep.")) %>% 
  mutate(ind = year > 1980)
party_colors <- c("#2E74C0", "#CB454A")

p <- ggplot(data = elections_historic, aes(x = popular_pct, y = ec_pct, color = win_party))
p + geom_vline(xintercept = 0.5, size = 1.4, color = "grey80") +
    geom_hline(yintercept = 0.5, size = 1.4, color = "grey80") +
    geom_point(data = subset(elections_historic, !ind), alpha = 0.4) +
    geom_point(data = subset(elections_historic, ind)) +
    geom_text_repel(data = subset(elections_historic, ind), aes(label = winner_label)) +
    scale_x_continuous(labels = scales::percent) +
    scale_y_continuous(labels = scales::percent) +
    scale_color_manual(labels = c("Democrat", "Republican"), values = party_colors) +
    guides(label = F) +
    labs(title = "Presidential elections: Popular & electoral college margins",
         subtitle = "1824-2016",
         caption = "Data for 2016 is provisional.",
         x = "Winner's share of popular vote", 
         y = "Winner's share of electoral college votes",
         color = "",
         label = NULL) +
    theme(legend.position = "top")

party_colors <- c("#2E74C0", "#CB454A")

p <- ggplot(subset(elections_historic, year > 1940), aes(x = popular_pct, y = reorder(winner_label, popular_pct), color = win_party))
p + geom_vline(xintercept = 0.5) + 
    geom_point() +
    guides(color = F) +
    scale_x_continuous(labels = scales::percent) +    
    scale_color_manual(values = party_colors) +
    labs(x = "Winner's share popular vote", y = NULL) +    
    guides(color = FALSE) +
    annotate(geom = "curve", xend = 0.508, yend = 8.75, x = 0.53, y = 7, 
             arrow = arrow(length = unit(0.5, "cm")), curvature = -0.3) +
    annotate(geom = "text", label = "In 2004, George Bush won Florida \n only after a controversial decision \n of the Supreme Court.",
             x = 0.53, y = 7, hjust = 0)

Work with Models

Draw Maps

Refine your Plots